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Abstract 

We consider opportunistic communication over multiple channels where the state ("good" or "bad") 
of each channel evolves as independent and identically distributed Markov processes. A user, with limited 
channel sensing and access capability, chooses one channel to sense and subsequently access (based on 
the sensed channel state) in each time slot. A reward is obtained whenever the user senses and accesses 
a "good" channel. The objective is to design an optimal channel selection policy that maximizes the 
expected total (discounted or average) reward accrued over a finite or infinite horizon. This problem can 
be cast as a Partially Observable Markov Decision Process (POMDP) or a restless multi-armed bandit 
process, to which optimal solutions are often intractable. We show in this paper that a myopic policy 
that maximizes the immediate one-step reward is always optimal when the state transitions are positively 
correlated over time. When the state transitions are negatively correlated, we show that the same policy 
is optimal when the number of channels is limited to 2 or 3, while presenting a counterexample for 
the case of 4 channels. This result finds applications in opportunistic transmission scheduling in a 
fading environment, cognitive radio networks for spectrum overlay, and resource-constrained jamming 
and anti-jamming. 

Preliminary version of this work was presented at IEEE International Conference on Communications (ICC), May 2008, 
Beijing, China. 
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Index Terms 

Opportunistic access, cognitive radio, POMDP, multi-armed bandit, restless bandit, Gittins index, 
Whittle's index, myopic policy. 
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I. Introduction 

We consider a communication system in which a sender has access to multiple channels, but 
is limited to sensing and transmitting only on one at a given time. We explore how a smart 
sender should exploit past observations and the knowledge of the stochastic state evolution of 
these channels to maximize its transmission rate by switching opportunistically across channels. 

We model this problem in the following manner. As shown in Figure 1, there are n channels, 
each of which evolves as an independent, identically-distributed, two-state discrete-time Markov 
chain. The two states for each channel — "good" (or state 1) and "bad" (or state 0) — indicate the 
desirability of transmitting over that channel at a given time slot. The state transition probabilities 
are given by p^-, i, j = 0,1. In each time slot the sender picks one of the channels to sense 
based on its prior observations, and obtains some fixed reward if it is in the good state. The basic 
objective of the sender is to maximize the reward that it can gain over a given finite time horizon. 
This problem can be described as a partially observable Markov decision process (POMDP) [1] 
since the states of the underlying Markov chains are not fully observed. It can also be cast as a 
special case of the class of restless multi-armed bandit problems [2]; more discussion on this is 
given in Section VII. 



Poi 




PlO 



Fig. 1. The Markov channel model. 

This formulation is broadly applicable to several domains. It arises naturally in opportunistic 
spectrum access (OS A) [3], [4], where the sender is a secondary user, and the channel states 
describe the occupancy by primary users. In the OS A problem, the secondary sender may send 
on a given channel only when there is no primary user occupying it. It pertains to communication 
over parallel fading channels as well, if a two-state Markovian fading model is employed. Another 
interesting application of this formulation is in the domain of communication security, where it 
can be used to develop bounds on the performance of resource-constrained jamming. A jammer 
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that has access to only one channel at a time could also use the same stochastic dynamic decision 
making process to maximize the number of times that it can successfully jam communications 
that occur on these channels. In this application, the "good" state for the jammer is precisely 
when the channel is being utilized by other senders (in contrast with the OS A problem). 

In this paper we examine the optimality of a simple myopic policy for the opportunistic access 
problem outlined above. Specifically, we show that the myopic policy is optimal for arbitrary 
n when p u > p i- We also show that it is optimal for n = 3 when p u < p i, while presenting 
a finite horizon counter example showing that it is in general not optimal for n > 4. We also 
generalize these results to related formulations involving discounted and average rewards over 
an infinite horizon. 

These results extend and complement those reported in prior work [5]. Specifically, it has been 
shown in [5] that for all n the myopic policy has an elegant and robust structure that obviates the 
need to know the channel state transition probabilities and reduces channel selection to a simple 
round robin procedure. Based on this structure, the optimality of the myopic policy for n = 2 
was established and the performance of the myopic policy, in particular, the scaling property 
with respect to n, analyzed in [5]. It was conjectured in [5] that the myopic policy is optimal for 
any n. This conjecture was partially addressed in a preliminary conference version [6], where 
the optimality was established under certain restrictive conditions on the channel parameters and 
the discount factor. In the present paper, we significantly relax these conditions and formerly 
prove this conjecture under the condition p n > p 01 . We also provide a counter example for 
Pu < Poi- 

We would like to emphasize that compared to earlier work [5], [6], the approach used 
in this paper relies on a coupling argument, which is the key to extending the optimality 
result to the arbitrary n case. Earlier techniques were largely based on exploiting the convex 
analytic properties of the value function, and were shown to have difficulty in overcoming the 
n = 2 barrier without further conditions on the discount factor or transition probabilities. This 
observation is somewhat reminiscent of the results reported in [7], where a coupling argument was 
also used to solve an n-queue problem while earlier versions [8] using value function properties 
were limited to a 2-queue case. We invite the interested reader to refer to [9], an important 
manuscript on monotonicity in MDPs which explores the power as well as the limitation of 
working with analytic properties of value functions and dynamic programming operators as we 
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had done in our earlier work. In particular, [9, Section 9.5] explores the difficulty of using such 
techniques for multi-dimensional problems where the number of queues is more than n = 2; [9, 
Chapter 12] contrasts this proof technique with the stochastic coupling arguments, which our 
present work uses. 

The remainder of this paper is organized as follows. We formulate the problem in Section II 
and illustrate the myopic policy in Section III. In Section IV, we prove that the myopic policy 
is optimal in the case of p n > p i> an d show in Section V that it is in general not optimal 
when this condition does not hold. Section VI extends the results from finite horizon to infinite 
horizon. We discuss our work within the context of the class of restless bandit problems as well 
as some related work in this area in Section VII. Section VIII concludes the paper. 

II. Problem Formulation 

We consider the scenario where a user is trying to access the wireless spectrum to maximize 
its throughput or data rate. The spectrum consists of n independent and statistically identical 
channels. The state of a channel is given by a two-state discrete time Markov chain shown in 
Figure 1. 

The system operates in discrete time steps indexed by t, t = 1, 2, • • • , T, where T is the 
time horizon of interest. At time t~, the channels (i.e., the Markov chains representing them) go 
through state transitions, and at time t the user makes the channel sensing and access decision. 
Specifically, at time t the user selects one of the n channels to sense, say channel i. If the 
channel is sensed to be in the "good" state (state 1), the user transmits and collects one unit of 
reward. Otherwise the user does not transmit (or transmits at a lower rate), collects no reward, 
and waits until t + 1 to make another choice. This process repeats sequentially until the time 
horizon expires. 

As mentioned earlier, this abstraction is primarily motivated by the following multi-channel 
access scenario where a secondary user seeks spectrum opportunity in between a primary user's 
activities. Specifically, time is divided into frames and at the beginning of each frame there is 
a designated time slot for the primary user to reserve that frame and for secondary users to 
perform channel sensing. If a primary user intends to use a frame it will simply remain active in 
a channel (or multiple channels) during that sensing time slot (i.e., reservation is by default for a 
primary user in use of the channel), in which case a secondary user will find the channel(s) busy 
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and not attempt to use it for the duration of that frame. If the primary user is inactive during this 
sensing time slot, then the remainder of the frame is open to secondary users. Such a structure 
provides the necessary protection for the primary user as channel sensing (in particular active 
channel sensing that involves communication between a pair of users) conducted at arbitrary 
times can cause undesirable interference. 

Within such a structure, a secondary user has a limited amount of time and capability to 
perform channel sensing, and may only be able to sense one or a subset of the channels before 
the sensing time slot ends. And if all these channels are unavailable then it will have to wait till 
the next sensing time slot. In this paper we will limit our attend to the special case where the 
secondary user only has the resources to sense one channel within this slot. Conceptually our 
formulation is easily extended to the case where the secondary user can sense multiple channels 
at a time within this structure, although the corresponding results differ, see e.g., [10]. 

Note that in this formulation we do not explicitly model the cost of channel sensing; it is 
implicit in the fact that the user is limited in how many channels it can sense at a time. Alternative 
formulations have been studied where sensing costs are explicitly taken into consideration in a 
user's sensing and access decision, see e.g., a sequential channel sensing scheme in [11]. 

In this formulation we have assumed that sensing errors are negligible. Techniques used in 
this paper may be applicable in proving the optimality of the myopic policy under imperfect 
sensing and for a general number of channels. The reason behind this is that our proof exploits 
the simple structure of the myopic policy, which remains when sensing is subject to errors as 
shown in [12]. 

Note that the system is not fully observable to the user, i.e., the user does not know the exact 
state of the system when making the sensing decision. Specifically, channels go through state 
transition at time t~ (or anytime between (t — 1,£)), thus when the user makes the channel 
sensing decision at time t, it does not have the true state of the system at time t, which we 
denote by s(t) = [si(t), s 2 (t), ■ ■ ■ , s n (t)] E {0, 1}™. Furthermore, even after its action (at time 
t + ) it only gets to observe the true state of one channel, which goes through another transition at 
or before time (t+ The user's action space at time t is given by the finite set {1, 2, • ■ ■ , n}, 
and we will use a(i) = i to denote that the user selects channel i to sense at time t. For clarity, 
we will denote the outcome/observation of channel sensing at time t following the action a(t) 
by h a (t)(t), which is essentially the true state s a u)(t) of channel a(t) at time t since we assume 
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channel sensing to be error-free. 

It can be shown (see e.g., [1], [13], [14]) that a sufficient statistic of such a system for 
optimal decision making, or the information state of the system [13], [14], is given by the 
conditional probabilities of the state each channel is in given all past actions and observations. 
Since each channel can be in one of two states, we denote this information state or belief vector 
by u(t) = [wi (£),••• ,u n (t)} E [0, l] n , where Ui(t) is the conditional probability that channel 
i is in state 1 at time t given all past states, actions and observations 1 . Throughout the paper 
Ui(t) will be referred to as the information state of channel i at time t, or simply the channel 
probability of i at time t. 

Due to the Markovian nature of the channel model, the future information state is only a 
function of the current information state and the current action; i.e., it is independent of past 
history given the current information state and action. It follows that the information state of 
the system evolves as follows. Given that the state at time t is u(t) and action a(t) = i is 
taken, cu^t + 1) can take on two values: (1) p u if the observation is that channel i is in a 
"good" state (hi(t) = 1); this occurs with probability P{hi(t) = l\u(t)} = Ui(t); (2) p i if 
the observation is that channel i is in a "bad" state (hi(t) = 0); this occurs with probability 
P{hi(t) = 0\u(t)} = 1 — uJi. For any other channel j ^ i, the corresponding Uj(t + 1) can 
only take on one value (i.e., with probability 1): uj(t + 1) = r(uj(t)) where the operator 
r : [0, 1] -> [0, 1] is defined as 

t(u) := upu + (1 - u)pqi, 0<u;<1. (1) 

These transition probabilities are summarized in the following equation for t — 1, 2, • • • , T — 1: 

Pn with prob. Ui(t) if a(t) = i 

{uji{t + l)\w(t),a(t)} = { p 0l with prob. 1 - if a(t) = i , i = 1, 2, ■ ■ • , n, (2) 

r(ui(t)) with prob. 1 if a(t) ^ i 

Also note that u>(l) E [0, 1]™ denotes the initial condition (information state) of the system, 
which may be interpreted as the user's initial belief about how likely each channel is in the 
good state before sensing starts at time t — 1. For the purpose of the optimization problems 

'Note that this is a standard way of turning a POMDP problem into a classic MDP (Markov decision process) problem by 
means of information state, the main implication being that the state space is now uncountable. 
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formulated below, this initial condition is considered given, which can be any probability vector 

2 

It is important to note that although in general a POMDP problem has an uncountable 
state space (information states are probability distributions), in our problem the state space 
is countable for any given initial condition 0(1). This is because as shown above, the in- 
formation state of any channel with an initial probability of to can only take on the values 

{uj, T k (u),poi, T k (u),pn, r k (u)}, where k = 1,2, ••• and r k (u) := r(r fc_1 (u;)), which is a 
countable set. 

For compactness of presentation we will further use the operator T to denote the above 
probability distribution of the information state (the entire vector): 

Q(t + l)=T(Q(t),a(t)), (3) 

by noting that the operation given in (2) is applied to u>(t) element-by-element. We will also 
use the following to denote the information state given observation outcome: 

T(Lu(t),a(t)\h a{t) (t) = 1) = (r(wi(t)), ■ ■ ■ ,r(w fl(tH (f)),pn,r(w a(t)+1 (f)),-- - ,r(w n (t))) (4) 
T(Lu(t),a(t)\h a{t) (t) = 0) = (r(wi(t)),-- • , r(w a(t) _i(t)),poi, r(u a{t)+ i(t)), ■ ■ ■ ,r(w n (*))) (5) 

The objective of the user is to maximize its total (discounted or average) expected reward over 
a finite (or infinite) horizon. Let J£ ((!>), Jg(u), and J^,(u>) denote, respectively, these cost criteria 
(namely, finite horizon, infinite horizon with discount, and infinite horizon average reward) under 
policy tt starting in state Cu = [uj\,- ■■ ,u n }. The associated optimization problems ((P1)-(P3)) 
are formally defined as follows. 

T 

(PI): max J%(u>) = maxE w [V/3*- 1 J R^(cj(t))|cj(l) = Q] 

TV TV ' ^ 

t=l 

OO 

(P2): max rJu) = max^[V/3*- 1 i? 7rt (cu(t))|cu(l) = Q] 

TV ^ TV ^ 

t=l 

1 T 

(P3): max^( W )=max lim ^M^M 1 ) = Q ] 

TV TV 1 — »00 ± ' ' 

t=l 

2 That is, the optimal solutions are functions of the initial condition. A reasonable choice, if the user has no special information 
other than the transition probabilities of these channels, is to simply use the steady-state probabilities of channels being in state 
"1" as an initial condition (i.e., setting u)i(l) = — ^ — ). 
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where (3 (0 < /3 < 1 for (PI) and < /3 < 1 for (P2)) is the discount factor, and R nt (uj(t)) is 
the reward collected under state u)(t) when channel a{t) = Tc t (cu(t)) is selected and h a (t)(t) is 
observed. This reward is given by R wt (u(t)) = 1 with probability u a t t ){t) (when h a u){t) = 1)» 
and otherwise. 

The maximization in (PI) is over the class of deterministic Markov policies. 3 . An admissible 
policy 7r, given by the vector n = [ttx, 7r 2 , ■ • • , n T ], is thus such that n t specifies a mapping from 
the current information state u(t) to a channel selection action a(t) = ix t (uj(t)) E {1, 2, • • • , n}. 
This is done without loss of optimality due to the Markovian nature of the underlying system, and 
due to known results on POMDPs. Note that the class of Markov policies in terms of information 
state are also known as seperated policies (see [14]). Due to finiteness of (unobservable) state 
spaces and action space in problem (PI), it is known that an optimal policy (over all random 
and deterministic, history-dependent and history-independent policies) may be found within the 
class of separated (i.e. deterministic Markov) policies (see e.g., [14, Theorem 7.1, Chapter 6]), 
thus justifying the maximization and the admissible policy space. 

In Section VI we establish the existence of a stationary separated policy it*, under which 
the supremum of the expected discounted reward as well as the supremum of expected average 
cost are achieved, hence justifying our use of maximization in (P2) and (P3). Furthermore, it is 
shown that under this policy the limit in (P3) exists and is greater than the limsup of the average 
performance of any other policy (in general history-dependent and randomized). This is a strong 
notion of optimality; the interpretation is that the most "pessimistic" average performance under 
policy it* (liminf |; Jy*(-) = lim ^ </£*(•)) is greater than the most "optimistic" performance 
under any other policy tt (limsup ^J^(-)). In much of the literature on MDP, this is referred to 
as the strong optimality for an expected average cost (reward) problem; for a discussion on this, 
see [15, Page 344]. 

III. Optimal Policy and the Myopic Policy 
A. Dynamic Programming Representations 

3 A Markov policy is a policy that derives its action only depending on the current (information) state, rather than the entire 
history of states, see e.g., [14]. 
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Problems (P1)-(P3) defined in the previous section may be solved using their respective dy- 
namic programming (DP) representations. Specifically, for problem (PI), we have the following 
recursive equations: 

V T (u) = max E[R a (u)\ 

a=l,2,--- ,n 

V t (Q) = max E[R a (uj)+/3V t+1 (T{u,a))} 

a=l,2,--- ,n 

= max (u a + pu a V t+1 (T(u,a\l))+(3(l-u a )V t+1 (T(uj,a\0))) , (6) 

o=l,— ,n 

for t = 1, 2, • • • , T — 1, where V t (uj) is known as the value function, or the maximum expected 
future reward that can be accrued starting from time t when the information state is Co. In 
particular, we have Vi(u>) = max^. J^(u)), and an optimal deterministic Markov policy exists 
such that a = achieves the maximum in (6) (see e.g., [15] (Chapter 4)). Note that since T 

is a conditional probability distribution (given in (3)), V t+ i(T(Co, a)) is taken to be the expectation 
over this distribution when its argument is T, with a slight abuse of notation, as expressed in 
(6). 

Similar dynamic programming representations hold for (P2) and (P3) as given below. For 
problem (P2) there exists a unique function Vg(-) satisfying the following fixed point equation: 

V P [Q) = max E[R a (u) + PV P (T{Q, a))} 

a=l,— ,n 

= max (u a + Pu a Vp(T(Q,a\l))+P(l-u a )Vp(T(Q,a\0))) ■ (7) 

a=l,— ,n 

We have that Vp(u) = max w Jg(u>), and that a stationary separated policy n* is optimal if and 
only if a = tv*(lj) achieves the maximum in (7) [16, Theorem 7.1]. 

For problem (P3), we will show that there exist a bounded function hoo(-) and a constant 
scalar J satisfying the following equation: 

J + h^iu) = max E[R a (u) + h^Tfa, a))} 

a=l,2,--- ,n 

= max (ua + UahooiT (u,a\l)) + (1 - u^)/^ (T(a>, a|0))). (8) 

a=l,"' ,n 

The boundedness of hoo and the immediate reward implies that J = max T J^cu), and that a 
stationary separated policy n* is optimal in the context of (P3) if and only if a = n*(uj) achieves 
the maximum in (8) [16, Theorems 6.1-6.3]. 

Solving (P1)-(P3) using the above recursive equations is in general computationally heavy. 
Therefore, instead of directly using the DP equations, the focus of this paper is on examining 
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the optimality properties of a simple, greedy algorithm. We define this algorithm next and show 
its simplicity in structure and implementation. 

B. The Myopic Policy 

A myopic or greedy policy ignores the impact of the current action on the future reward, fo- 
cusing solely on maximizing the expected immediate reward. Myopic policies are thus stationary. 
For (PI), the myopic policy under state Cu = [ui, u 2 , ■ ■ ■ , co n ] is given by 

a*(uj) = arg max E[R a {uj)} = arg max uj a . (9) 

a=l,---,n a=l, ■■■ ,n 

In general, obtaining the myopic action in each time slot requires the successive update of 
the information state as given in (2), which explicitly relies on the knowledge of the transition 
probabilities {pij} as well as the initial condition u)(l). Interestingly, it has been shown in [5] 
that the implementation of the myopic policy requires only the knowledge of the initial condition 
and the order of pn and p 01 , but not the precise values of these transition probabilities. To make 
the present paper self-contained, below we briefly describe how this policy works; more details 
may be found in [5]. 

Specifically, when p n > p m the conditional probability updating function t{uj) is a monotoni- 
cally increasing function, i.e., t(ui) > t(uj 2 ) for u\ > u 2 . Therefore the ordering of information 
states among channels is preserved when they are not observed. If a channel has been observed to 
be in state "1" (respectively "0"), its probability at the next step becomes pn > t(u) (respectively 
Poi < T ( UJ )) f° r an Y u [0? !]■ I n other words, a channel observed to be in state "1" (respectively 
"0") will have the highest (respectively lowest) possible information state among all channels. 

These observations lead to the following implementation of the myopic policy. We take the 
initial information state ^(1), order the channels according to their probabilities and probe 

the highest one (top of the ordered list) with ties broken randomly. In subsequent steps we stay 
in the same channel if the channel was sensed to be in state "1" (good) in the previous slot; 
otherwise, this channel is moved to the bottom of the ordered list, and we probe the channel 
currently at the top of the list. This in effect creates a round robin style of probing, where the 
channels are cycled through in a fixed order. This circular structure is exploited in Section IV 
to prove the optimality of the myopic policy in the case of pn > p m . 
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When pu < p 01 , we have an analogous but opposite situation. The conditional probability 
updating function t(u) is now a monotonically decreasing function, i.e., t(cj 1 ) < t(uj 2 ) for 
uji > uj 2 . Therefore the ordering of information states among channels is reversed at each time 
step when they are not observed. If a channel has been observed to be in state "1" (respectively 
"0"), its probability at the next step becomes p u < t(u) (respectively p 01 > r(u)) for any 
uj E [0, 1]. In other words, a channel observed to be in state "1" (respectively "0") will have the 
lowest (respectively highest) possible information state among all channels. 

As in the previous case, these similar observations lead to the following implementation. We 
take the initial information state order the channels according to their probabilities Ui(l), 
and probe the highest one (top of the ordered list) with ties broken randomly. In each subsequent 
step, if the channel sensed in the previous step was in state "0" (bad), we keep this channel 
at the top of the list but completely reverse the order of the remaining list, and we probe this 
channel. If the channel sensed in the previous step was in state "1" (good), then we completely 
reverse the order of the entire list (including dropping this channel to the bottom of the list), and 
probe the channel currently at the top of the list. This alternating circular structure is exploited 
in Section V to examine the optimality of the myopic policy in the case of p u < Poi- 

IV. Optimality of the Myopic Policy in the Case of p n > p m 

In this section we show that the myopic policy, with a simple and robust structure, is optimal 
when pu > p 01 . We will first show this for the finite horizon discounted cost case, and then 
extend the result to the infinite horizon case under both discounted and average cost criteria in 
Section VI. 

The main assumption is formally stated as follows. 

Assumption 1: The transition probabilities poi an d Pu are such that 

Pu-Poi > 0. (10) 
The main theorem of this section is as follows. 

Theorem 1: Consider Problem (PI). Define V t (u;a) := E[R a {uj) + f3V t+ i(T(u), a))], i.e., the 
value of the value function given in Eqn (6) when action a is taken at time t followed by an 
optimal policy. Under Assumption 1, the myopic policy is optimal, i.e. for Vt, 1 < t < T, and 

Va> = ,u n ] G [0,1]", 

V t (u]a = j)-V t {u;a = i)>0, (11) 
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if Uj > LUi, for % = 1, • • • , n. 

The proof of this theorem is based on backward induction on t: given the optimality of the 
myopic policy at times t + 1, t + 2, • • • , T, we want to show that it is also optimal at time t. 
This relies on a number of lemmas introduced below. The first lemma introduces a notation that 
allows us to express the expected future reward under the myopic policy. 

Lemma 1: There exist T n-variable functions, denoted by W t (), t = 1,2, ••• ,T, each of 
which is a polynomial of order l 4 and can be represented recursively in the following form: 

Wt(u) =u n + uj n /3W t +i{r(uJi), . . . ,r(u; n _i),pii) + (1 - w n )^ w (poi,r(wi), . . .,t(w„-i)),(12) 

where uo = [u)\, w 2 , • " > u n] and W T (Cu) = uo n . 

Proof: The proof is easily obtained using backward induction on t given the above recursive 
equation and noting that WtQ is one such polynomial and the mapping r() is a linear operation. 

■ 

Corollary 1: When uj represents the ordered list of information states [ui,lu 2 ,--- ,uj n ] with 
toi < cu2 < ■ ■ ■ < uj n , then W t (uj) is the expected total reward obtained by the myopic policy 
from time t on. 

This result follows directly from the description of the policy given in Section III-B. 
Proposition 1: The fact that W t is a polynomial of order 1 and affine in each of its elements 
implies that 

W t {ui,--- , u n -2, y, x) - W t {u)i, ■■■ , u n _ 2 , x, y) 
= (x-y)\W t (wi,--' ,Wn-2,0,l)-Wi(wi,--- ,w n -2,l,0)] . (13) 
Similar results hold when we change the positions of x and y. 

To see this, consider W t {wi, ■ ■ ■ , k> n -2, x, y) and W t {wi, ■ ■ ■ , uJ n -2, y, x), as functions of x and 
y, each having an x term, a y term, an xy term and a constant term. Since we are just swapping 
the positions of x and y in these two functions, the constant term remains the same, and so 
does the xy term. Thus the only difference is the x term and the y term, as given in the above 
equation. This linearity result will be used later in our proofs. 

The next lemma establishes a necessary and sufficient condition for the optimality of the 
myopic policy. 

4 Each function Wt is affine in each variable, when all other variables are held constant. 
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Lemma 2: Consider Problem (PI) and Assumption 1. Given the optimality of the myopic 
policy at times t + 1, t + 2, • • • , T, the optimality at time t is equivalent to: 

W t {uu ■ ■ ■ , tOi-i, ...,u n ,Ui)<Wt{u}i,..., uj n ), for all u x < ■ ■ ■ < u { < ■ ■ - < u n . 

Proof: Since the myopic policy is optimal from t + 1 on, it is sufficient to show that probing 
uj n followed by myopic probing is better than probing any other channel followed by myopic 
probing. The former is precisely given by the RHS of the above equation; the latter by the LHS, 
thus completing the proof. ■ 
Having established that W t (cu) is the total expected reward of the myopic policy for an 
increasingly -ordered vector tu = [a?i, • • • , u> n ], we next proceed to show that we do not decrease 
this total expected reward W t (uj) by switching the order of two neighboring elements uji and 
uj i+ i if uji > uj i+1 . This is done in two separate cases, when i + 1 < n (given in Lemma 4) and 
when i + 1 = n (given in Lemma 5), respectively. The first case is quite straightforward, while 
proving the second cased turned out to be significantly more difficult. Our proof of the second 
case (Lemma 5) relies on a separate lemma (Lemma 3) that establishes a bound between the 
greedy use of two identical vectors but with a different starting position. The proof of Lemma 3 
is based on a coupling argument and is quite instructive. Below we present and prove Lemmas 
3, 4 and 5. 

Lemma 3: For < uj\ < lu 2 < . . . < uj n < 1 , we have the following inequality for all 
t=l,2,---,T: 

1 + W t {u 2 ,...,u n ,u 1 ) > Wticoi,...,^). (14) 

Proof: We prove this lemma using a coupling argument along any sample path. The LHS 
of the above inequality represents the expected reward of a policy (referred to as L below) that 
probes in the sequence of channels 1 followed by n, n — 1, • • •, and then 1 again, and so on, 
plus an extra reward of 1; the RHS represents the expected reward of a policy (referred to as 
R below) that probes in the sequence of channels n followed by n — 1, • ■ • , and 1 and then n 
again, and so on. It helps to imagine lining up the n channels along a circle in the sequence of 
n, n — 1, • • • , 1, clock-wise, and thus L's starting position is 1, R's starting position is n, exactly 
one spot ahead of L clock-wise. Each will cycle around the circle till time T. 

Now for any realization of the channel conditions (or any sample path of the system), consider 
the sequence of "0"s and "l"s that these two policies see, and consider the position they are on 
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the circle. The reward a policy gets along a given sample path is R t = J2j=t ft 1 f° r policy L, 
where ji = j if L sees a "1" at time j, and otherwise; the reward for R is R r = Ylj=t ft r w i m 
j r similarly defined. There are two cases. 

Case (1): the two eventually catch up with each other at some time K < T, i.e., at some 
point they start probing exactly the same channel. From this point on the two policies behave 
exactly the same way along the same sample path, and the reward they obtain from this point 
on is exactly the same. Therefore in this case we only need to compare the rewards (L has an 
extra 1) leading up to this point. 

Case (2): The two never manage to meet within the horizon T. In this case we need to compare 
the rewards for the entire horizon (from t to T). 

We will consider Case (1) first. There are only two possibilities for the two policies to meet: 
(Case l.a) either L has seen exactly one more "0" than R in its sequence, or (Case l.b) R has 
seen exactly n — 1 more "0"s than L. This is because the moment we see a "0" we will move to 
the next channel on the circle. L is only one position behind R, so one more "0" will put it at 
exactly the same position as R. The same with R moving n — 1 more positions ahead to catch 
up with L. 

Case (l.a): L sees exactly one more "0" than R in its sequence. The extra "0" necessarily occurs 
at exactly time K, t < K < T, meaning that at K, L sees a "0" and R sees a "1". From t to K, 
if we write the sequence of rewards (zeros and ones) under L and R, we observe the following: 
between t and K both L and R have equal number of zeros, while for W = t, t + 1, . . . , K — 1, 
the number of zeros up to time t' is less (or no more) for L than for R. In other words, L 
and R see the same number of "0"s, but L's is always lagging behind (or no earlier). That is, 
for every "0" R sees, L has a matching "0" that occurs no earlier than R's "0." This means 
that if we denote by Ri(ti,t 2 ) the rewards accumulated between t\ and t 2 , then for the rewards 
in [t,K - 1], we have Ri{t,t') > R r (t,t'), for Vf < K - 1, while Ri(K,K) = (3 K and 
R r (K, K) = 0. Finally by definition we have R t (K + 1, T) = R r (K + 1, T). Therefore overall 
we have 1 + R t (t, T) > R r (t, T), proving the above inequality. 

Case (l.b): R sees n— 1 more "0"s than L does. The comparison is simpler. We only need to 
note that R's "0"s must again precedes (or be no later than) L's since otherwise we will return 
to Case (l.a). Therefore we have Ri > R r , and thus 1 + Ri > R r is also true. 

We now consider Case (2). The argument is essentially the same. In this case the two don't 
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get to meet, but they are on their way, meaning that either L has exactly the same "0"s as R 
and their positions are no earlier (corresponding to Case (l.a)), or R has more "0"s than L (but 
not up to n — 1) and their positions are no later than L's (corresponding to Case (l.b)). So either 
way we have 1 + Ri > R r . 

The proof is thus complete. ■ 

Lemma 4: For all j, 1 < j ' < n — 3, and all x > y, we have 

W t {uJi, ...,Uj,x,y,..., u n ) < W t (wi, ...,Uj,y,x,..., u n ) (15) 

Proof: We prove this by induction over t. The claim is obviously true for t = T, since both 
sides will be equal to u n , thereby establishing the induction basis. Now suppose the claim is 
true for all t + 1, • • • , T — 1. We have 

W t (uJi, ■■■ , ujj-i, x,y,--- , u n ) 
= w n (l + /3Wi + i(r(u;i),--- , r(x), r(y), ■ ■ ■ ,r(w n _i),pu)) 
+ (1 - Lu n )f3W t+1 (p 01 , r(wi), • ■ ■ , t(x), r(y), ■■■ , r(u n -i)) 
< cj n (l+^Wt+i(r(a;i),--- , r(y), r(x), ■ ■ ■ ,r(w„_i),pu)) 
+ (1 - uj n )f3W t +i(poi, r(wi), • • • , r(y), t(x), • • • , r(w n _i)) 
= W t (u;i,--- ,Uj„i,y,x, ••• ,w n ) (16) 

where the inequality is due to the induction hypothesis, and noting that r() is a monotone 
increasing mapping in the case of pn > po\. ■ 
Lemma 5: For all x > y, we have 

W t (ui, u n -2, x, y) < W t {ui, u n -2, y, x). (17) 

Proof: This lemma is proved inductively. The claim is obviously true for t = T. Assume it 
also holds for times t + 1, • • • , T — 1. We have by the definition of W t Q and due to its linearity 
property: 

W t {ui, u n -2, y, x) - Wtiuj!, u n _ 2 , x, y) 
= (x- y)(W t (uj 1 , . . . ,w n _ 2> 0, 1) - W t {wi, ■ ■ ■ ,UJ n -2, 1,0)) 

= (x-y) (1 +/3W t+ i(r(a;i), . . . , r(o; n _ 2 ),poi,pii) - /3Wi+i(Poi, T (^i)> • • • , r(^„_ 2 ),Pn)) . 
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But from the induction hypothesis we know that 

Wi +1 (r(a;i), . . .,r(w n _ 2 ),Poi,Pii) > W / 4+ i(r(o;i), . . . ,r(w n _ 2 ),pn,Poi)- (18) 
This means that 

1 + /3W / i + i(r(cji), . . . ,T(u n - 2 ),Pai,Pii) - PW t+ i(p i, r(wi), . . . ,T(u; n _2),j>ii) 
> l + /3W 7 i+ i(r(^i),...,r(^ n _2),pn,poi) - y5Wt + i(p i,T(o;i), . . . ,r(o; n _ 2 ),pii) > , 

where the last inequality is due to Lemma 3 (note that in that lemma we proved 1 + A > B, 
which obviously implies 1 + (3 A > (3B for < (3 < 1 that is used above). This, together with 
the condition x > y, completes the proof. ■ 
We are now ready to prove the main theorem. 

Proof of Theorem 1: The basic approach is by induction on t. The optimality of the myopic 
policy at time t = T is obvious. So the induction basis is established. Now assume that the 
myopic policy is optimal for all times t + 1, t + 2, • • • , T — 1, and we will show that it is also 
optimal at time t. By Lemma 2 this is equivalent to establishing the following 

W t (Ui, . . . , UJi-i, L0 i+1 , ...,UJ n , LOi) < Wtiut, ...,U} n ). (19) 

But we know from Lemmas 4 and 5 that, 

W t (Ui,..., Wi-l, UJ i+1 , ...,U n ,Ui) <W t (wi,..., Ui_i, UJ i+1 , ...,Ui, u n ) 
< W t (uJ 1} . . . , UJi-i, UJi+i, ...,LUi, UJ n -i, UJ n ) < ■ ■ ■ < WtiuJx, ...,U n ) , 

where the first inequality is the result of Lemma 5, while the remaining inequalities are repeated 
application of Lemma 4, completing the proof. ■ 
We would like to emphasize that from a technical point of view, Lemma 3 is the key to 
the whole proof: it leads to Lemma 5, which in turn leads to Theorem 1. While Lemma 5 
was easy to conceptualize as a sufficient condition to prove the main theorem, Lemma 3 was 
much more elusive to construct and prove. This, indeed, marks the main difference between the 
proof techniques used here vs. that used in our earlier work [6]: Lemma 3 relies on a coupling 
argument instead of the convex analytic properties of the value function. 
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V. The Case of p n < p 01 

In the previous section we showed that a myopic policy is optimal if p\\ > p m . In this section 
we examine what happens when pu < p m , which corresponds to the case when the Markovian 
channel state process exhibits a negative auto-correlation over a unit time. This is perhaps a case 
of less practical interest and relevance. However, as we shall see this case presents a greater 
degree of technical complexity and richness than the previous case. Specifically, we first show 
that when the number of channels is three (n = 3) or when the discount factor /3 < \, the 
myopic policy remains optimal even for the case of pn < p i (the proof for two channels in this 
case was given earlier in [5]). We thus conclude that the myopic policy is optimal for n < 3 or 
(3 < 1/2 regardless of the transition probabilities. We then present a counter example showing 
that the the myopic policy is not optimal in general when n > 4 and (3 > 1/2. In particular, our 
counter example is for a finite horizon with n = 4 and (3=1. 

A. n = 3 or (3 <\ 

We start by developing some results parallel to those presented in the previous section for the 
case of pn > poi. 

Lemma 6: There exist T n- variable polynomial functions of order 1, denoted by Z t (),t = 
1, 2, • • ■ , T, i.e., each function is linear in all the elements, and can be represented recursively 
in the following form: 

Z t (u)) := uj n (l + (3Z t+1 (p n , r(a; ri _i), . . . , r(ui))) 

+ (1 - u n )i3Z t+1 (r(uj n ^ 1 ), . . . ,r(o;i),poi)- (20) 

where Zt(uj) = to n . 

Corollary 2: Z t (Cu) given in (20) represents the expected total reward of the myopic policy 
when Co is ordered in increasing order of Wj. 

Similar to Corollary 1, the above result follows directly from the policy description given in 
Section III-B. 

It follows that the function Z t also has the same linearity property presented earlier, i.e. 

= (x - y)(Z t (ui, ■ ■ ■ ,w„_2,0, 1) - Z t {ut, ■ ■ ■ ,u n -2, 1,0)) . (21) 
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Similar results hold when we change the positions of x and y. 

In the next lemma and theorem we prove that the myopic policy is still optimal when p u < p 01 
if n = 3 or (3 < 1/2 . In particular, Lemma 7 below is the analogy of Lemmas 4 and 5 combined. 

Lemma 7: At time t (t — 1, 2, • • • , T), for all j < n — 2, we have the following inequality 

for VI > x > y > if either n = 3 or (3 < 1/2: 



Proof: We prove this by induction on t. The claim is obviously true for t = T. Now suppose 
it's true for t + 1, • • • , T — 1. Due to the linearity property of Z t , 



= - y) {Zt(ui, ...,Uj,0,l, u j+3 , ...,u n )- Z t {ui, ...,^-,1,0, u j+3 , u n )) . (23) 
Thus it suffices to show that Z t (ui, . . . , LOj, 0, 1, ujj +3 , . . . , uj n ) > Z t (ui, . . . , Uj, 1, 0, . . . , cj n ). 

We treat the case when j < n — 2 and j = n — 2 separately. Indeed, without loss of generality, 
let j ' = n — 3 (the proof follows exactly for all j < n — 3 with more lengthy notations). At time 
t we have 

Z t (u x , W n -3, 0, 1, cj„) - . . . , CJ n _ 3 , 1, 0, u n ) 

= ujp(Z t+ i(p u ,pii,poi,T(uj n _ 3 ), . . . ,t(o;i)) - Z t+ i(pii,p i,Pn, r(u; n _ 3 ), . . .,r(wi))) 
+ (1 - uj)p(Z t+1 (pu,p 01 ,r(uj n - 3 ), . . . ,r(wi),poi) ~ ^+i(Poi,Pn,T(w n -3), • • • ,^"M,poi)) 
> 

where the last inequality is due to the induction hypothesis. 
Now we will consider the case when j = n — 2. 

Zf(u;i, . . . , u; n _ 2 , 0, 1) - Z t (ux, uj n _ 2 , 1, 0) 

= 1 + /3Z m (p n ,p 01 , r(u; n _ 2 ), . . . , r(wi)) - /3Z m (p n , r(u; n _2), • • • , t(cji), p m ). (24) 

Next we show that if/3<l/2orn = 3 the right hand side of (24) is non-negative. 
If (3 < 1/2, then 



Zt(u!, . . . ,Uj,y,x,u j+3 , . . . 



Wn) > Z t (uJi,...,UJj,X,y,LOj +3 ,. 



(22) 



Z t (o;i, . . . ,Uj,y,x,u j+3 , . . . 



Z t (ui, . . . ,ujj,x,y,u j+3 , . . . 



1 + pZ t+1 (p u ,p 01 ,r(uj n ^2 



r(wi)) - /3Z i+1 (pn, r(u; ri _2 



• • ■ ,t(^i),Poi) 
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If n = 3, then 

1 + (3Z t+1 (p u ,p Q1 , t(ui))- (3Z t+1 (p n , r(o; 1 ),p i) 
= 1 + /3(r(wi) - poi)(^+i(pii, 0, 1) - Z t+1 ( Pll , 1, 0)) 

> 1 - ^(Zt+iCpn, 0, 1) - Z m (pn, 1, 0)) 

> 

where the first inequality is due to the fact that — 1 < t(ui) — poi < and the last inequality is 
given by the induction hypothesis. ■ 
Theorem 2: Consider Problem (PI). Assume that p u < p 01 . The myopic policy is optimal 
for the case of n = 3 and the case of (3 < 1/2 with arbitrary n. More precisely, for these two 
cases, Vt, 1 < t < T, we have 

V t (Q;a=j)-Vt(p;a = i)>0, (25) 

if ujj > Ui for i — 1, • • ■ , n. 

Proof: We prove by induction on t. The optimality of the myopic policy at time t = T is 
obvious. Now assume that the myopic policy is optimal for all times t + 1, t + 2, • • • , T — 1, and 
we want to show that it is also optimal at time t. Suppose at time t the channel probabilities 
are such that uj n > Ui for i = 1, ■ ■ ■ ,n — 1. The myopic policy is optimal at time t if and only 
if probing ui n followed by myopic probing is better than probing any other channel followed by 
myopic probing. Mathematically, this means 

Z t (ui, . . . , Ui-t, oj i+1 , ...,u n , Ui) < Z t (ui, ...,u n ), for all ui < 0Ji < uj n . 

But this is a direct consequence of Lemma 7, completing the proof. ■ 

B. A 4-channel Counter Example 

The following example shows that the myopic policy is not, in general, optimal for n > 4 
when p n < p 01 . 

Example 1: Consider an example with the following parameters: p i = 0.9, pn = 0.1, @ = 
1, and u = [.97, .97, .98, .99]. Now compare the following two policies at time T — 3: play 
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myopically (I), or play the .98 channel first, followed by the myopic policy (II). Computation 
reveals that 

V#_ 3 (.97, -97, .98, .99) = 2.401863 
< Mr£ 3 (.97, -97, .98, .99) = 2.402968 

which shows that the myopic policy is not optimal in this case. 

It remains an interesting question as to whether such counter examples exist in the case when 
the initial condition is such that all channel are in the good state with the stationary probability. 

VI. Infinite Horizon 

Now we consider extensions of results in Sections IV and V to (P2) and (P3), i.e., to show 
that the myopic policy is also optimal for (P2) and (P3) under the same conditions. Intuitively, 
this holds due to the fact that the stationary optimal policy of the finite horizon problem is 
independent of the horizon as well as the discount factor. Theorems 3 and 4 below concretely 
establish this. 

We point out that the proofs of Theorems 3 and 4 do not rely on any additional assump- 
tions other than the optimality of the myopic policy for (PI). Indeed, if the optimality of the 
myopic policy for (PI) can be established under weaker conditions, Theorems 3 and 4 can be 
readily invoked to establish its optimality under the same weaker condition for (P2) and (P3), 
respectively. 

Theorem 3: If myopic policy is optimal for (PI), it is also optimal for (P2) for < (5 < 1. 
Furthermore, its value function is the limiting value function of (PI) as the time horizon goes 
to infinity, i.e., we have max,,- Jg(0) = hin^^oo max^ J^iuj). 

Proof: We first use the bounded convergence theorem (BCT) to establish the fact that under 
any deterministic stationary Markov policy n, we have Jg (u>) = lim-r^oo Jy(a>). We prove this 
by noting that 

T 

^ t=i 

T 

= lim E^p- 1 R« {t) {u{t))\Q{l)=Q] 
t=l 

= lim J£(u>) (26) 

T — >oo 
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where the second equality is due to BCT for Y^t=i Z^ -1 -^*)^^)) < ^+3. This proves the 
second part of the theorem by noting that due to the finiteness of the action space, we can 
interchange maximization and limit. 

Let 7r* denote the myopic policy. We now establish the optimality of n* for (P2). From 
Theorem 1, we know: 

Jj, (u) = max {ui + fiujiJ^-x (T 
a i 

+P(l-u i )j£_ 1 {T{cD,i\0))}. 

Taking limit of both sides, we have 

Jg (o) = max [uoi + (3uJi JJ (T(u), i|l)) 

a=i 

+P(l-u i )jf(T(u,i\0))}. (27) 

Note that (27) is nothing but the dynamic programming equation for the infinite horizon dis- 
counted reward problem given in (7). From the uniqueness of the dynamic programming solution, 
then, we have 

Jg (Co) = Vg(u) = ma.xJg(uj) 

IT 

hence, the optimality of the myopic policy. 

■ 

Theorem 4: Consider (P3) with the expected average reward and under the ergodicity as- 
sumption |pxi — pool < 1. Myopic policy is optimal for problem (P3) if it is optimal for (PI). 

Proof: We consider the infinite horizon discounted cost for /3 < 1 under the optimal policy 
denoted by it*: 

J g (u) = max {tUi + (3tUi J g (T (u,i\l)) 

a=i 

+P(l-Ui)jf (T(O,i\0))}. (28) 

This can be written as 

(l-(3)jf(u) 
= + ^ [Jf (T(p,i\l))-Jf(w)] 

+0(1 -w<) W (T(6j,i\0))-Jf(u)]}. 

May 6, 2008 DRAFT 



23 



Notice that the boundedness of the reward function and compactness of information state 
implies that the sequence of {(1 — (3) J^* (Cu)} is bounded, i.e. for all < (3 < 1, 

(l-(3)jf (u>) < 1. (29) 

Also, applying Lemma 2 from [6] (which provides an upper bound on the difference in value 
functions between taking two different actions followed by the optimal policy) and noting that 

— 1 < pn — poo < 1, we have that there exists some positive constant K : = i_| Pl ^_ P01 | suc h that 

\jf(T(u,i\0))-jf((D)\<K. (30) 

By Bolzano- Weierstrass theorem, (29) and (30) guarantee the existence of a converging 
sequence (3k — > 1 such that 

lim (1 - f3 k )Jg*(u)*) := J*, (31) 

and hm [j£ (0) - jf k (a)*)] := hf (w) , (32) 

where u* := i_^ +pm ls the steady-state belief (the limiting belief when channel i is not sensed 
for a long time). 

As a result, (31) can be written as 

J* = Km {(1 - fc)j£(aT) + (1 - (3 k ) fe» - J£ fc >*)] } . 
In other words, 

J* = lim max [uji + Pk^i \J$ {T(uj,i\l)) 

+ A (1 - W,) [Jj (T(a;,i|0)) - 

From (32), we can write this as 

J* + h**(u>) = max {uji + ujih 77 * {T{u,i\l)) + 

(l-c^)/^(T(uM|0))}. (33) 

Note that (33) is nothing but the DP equation as given by (8). In addition, we know that the 
immediate reward as well as function h are both bounded by max(l, K). This implies that J* 
is the maximum average reward, i.e. J* = max^- J^(o;(i)) (see [16, Theorems 6.1-6.3]). 
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On the other hand, we know from Theorem 3 that the myopic policy is optimal for (P2) if it 
is for (PI), and thus we can take n* in (28) to be the myopic policy. Rewriting (28) gives the 
following: 

Jf(u) = w** {s » + Pu^Jf (T(cu,7r*(cu)\l)) 
+P(l-u^ o) )jf{T(Q,n*{Q)\0)) . 
Repeating steps (31)-(33) we arrive at the following: 

J + h*'(Q) = u^p) + (T (u,tt*(u)\1)) + 

{l-u^ o) )h^{T(u,ir*{u)\0)), (34) 

which shows that (J*,h n *,n*) is a canonical triplet [16, Theorems 6.2]. This, together with 
boundedness of /i 71 "* and immediate reward, implies that the myopic policy n* is optimal for 
(P3) [16, Theorems 6.3]. ■ 

VII. Discussion and Related Work 

The problem studied in this paper may be viewed as a special case of a class of MDPs known 
as the restless bandit problems [2]. In this class of problems, N controlled Markov chains (also 
called projects or machines) are activated (or played) one at a time. A machine when activated 
generates a state dependent reward and transits to the next state according to a Markov rule. A 
machine not activated transits to the next state according to a (potentially different) Markov rule. 
The problem is to decide the sequence in which these machines are activated so as to maximize 
the expected (discounted or average) reward over an infinite horizon. To put our problem in this 
context, each channel corresponds to a machine, and a channel is activated when it is probed, and 
its information state goes through a transition depending on the observation and the underlying 
channel model. When a channel is not probed, its information state goes through a transition 
solely based on the underlying channel model 5 . 

In the case that a machine stays frozen in its current state when not played, the problem 
reduces to the multi-armed bandit problem, a class of problems solved by Gittins in his 1970 

5 The standard definition of bandit problems typically assumes finite or countably infinite state spaces. While our problem 
can potentially have an uncountable state space, it is nevertheless countable for a given initial state. This view has been taken 
throughout the paper. 



May 6, 2008 



DRAFT 



25 



seminal work [17]. Gittins showed that there exists an index associated with each machine 
that is solely a function of that individual machine and its state, and that playing the machine 
currently with the highest index is optimal. This index has since been referred to as the Gittins 
index due to Whittle [18]. The remarkable nature of this result lies in the fact that it essentially 
decomposes the ^-dimensional problem into iV 1 -dimensional problems, as an index is defined 
for a machine independent of others. The basic model of multi-armed bandit has been used 
previously in the context of channel access and cognitive radio networks. For example, in [19], 
Bayesian learning was used to estimate the probability of a channel being available, and the 
Gittins indices, calculated based on such estimates (which were only updated when a channel 
is observed and used, thus giving rise to a multi-armed bandit formulation rather than a restless 
bandit formulation), were used for channel selection. 

On the other hand, relatively little is known about the structure of the optimal policies for 
the restless bandit problems in general. It has been shown that the Gittins index policy is not 
in general optimal in this case [2], and that this class of problems is PSPACE-hard in general 
[20]. Whittle, in [2], proposed a Gittins-like index (referred to as the Whittle's index policy), 
shown to be optimal under a constraint on the average number of machines that can be played 
at a given time, and asymptotically optimal under certain limiting regimes [21]. There has been 
a large volume of literature in this area, including various approximation algorithms, see for 
example [22] and [23] for near-optimal heuristics, as well as conditions for certain policies to 
be optimal for special cases of the restless bandit problem, see e.g., [24], [25]. The nature of 
the results derived in the present paper is similar to that of [24], [25] in spirit. That is, we have 
shown that for this special case of the restless bandit problem an index policy is optimal under 
certain conditions. For the indexability (as defined by Whittle [2]) of this problem, see [26]. 

Recently Guha and Munagala [27], [28] studied a class of problems referred to as the feedback 
multi-armed bandit problems. This class is very similar to the restless bandit problem studied 
in the present paper, with the difference that channels may have different transition probabilities 
(thus this is a slight generalization to the one studied here). While we identified conditions 
under which a simple greedy index policy is optimal in the present paper, Guha and Munagala 
in [27], [28] looked for provably good approximation algorithms. In particular, they derived a 
2 + e-approximate policy using a duality-based technique. 
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VIII. Conclusion 

The general problem of opportunistic sensing and access arises in many multi-channel com- 
munication contexts. For cases where the stochastic evolution of channels can be modelled as 
i.i.d. two-state Markov chains, we showed that a simple and robust myopic policy is optimal for 
the finite and infinite horizon discounted reward criteria as well as the infinite horizon average 
reward criterion, when the state transitions are positively correlated over time. When the state 
transitions are negatively correlated, we showed that the same policy is optimal when the number 
of channels is limited to 2 or 3, and presented a counterexample for the case of 4 channels. 
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